Introduction:

This project conducts a thorough exploration and analysis of COVID-19 data related to cases and deaths across Brazil, covering the period from early 2020 to early 2022. The analysis solely utilizes R for data preprocessing, manipulation, and visualisation, aiming to identify key patterns and insights into the pandemic’s impact across different Brazilian regions.

Data Source:

The dataset, sourced from an official Brazilian database, provides detailed records from epidemiological reports across the country.

Data Preprocessing:

The initial step involves cleaning and preparing the data for analysis using R, focusing on selecting necessary variables and correcting data discrepancies.

Exploratory Data Analysis:

Deep dive into the data using statistical techniques and visualisations to uncover temporal and regional trends in case and death rates.

Key Findings:
  1. The analysis reveals significant regional differences in the impact of COVID-19 and the effectiveness of response measures.
  2. Temporal trends indicate critical periods of infection spread and mortality rates.
Conclusion:

The findings underscore the importance of tailored public health responses and provide valuable insights for policymakers and health professionals.

Dictionary of Variables:

Let’s get started!

Data Pre-Processing

Firstly, the libraries that were necessary for this project were loaded. Note that you might need to install the ‘coronabr’ package. Uncomment the first two lines to install ‘remotes’ and ‘coronabr’.
#install.packages('remotes')
#remotes::install_github('liibre/coronabr')

library(coronabr)
library(data.table)
library(dplyr)
library(lubridate)
library(ggplot2)
library(plotly)
library(ggpubr)

In this project, a dataset, from an official Brazilian source, was used to explore the Covid19 evolution for a period of 2 years. In this first part, the data was pre-processed and in the next section the data was further explored in order to find out insights.

The dataset was loaded using the package ‘coronabr’.The first rows of the data sre displayed below.
data <- get_corona_br(save = FALSE)
head(data)
## # A tibble: 6 × 18
##   city       city_ibge_code date       epidemiological_week estimated_population
##   <chr>               <dbl> <date>                    <dbl>                <dbl>
## 1 Rio Branco        1200401 2020-03-17               202012               413418
## 2 <NA>                   12 2020-03-17               202012               894470
## 3 Rio Branco        1200401 2020-03-18               202012               413418
## 4 <NA>                   12 2020-03-18               202012               894470
## 5 Rio Branco        1200401 2020-03-19               202012               413418
## 6 <NA>                   12 2020-03-19               202012               894470
## # ℹ 13 more variables: estimated_population_2019 <dbl>, is_last <lgl>,
## #   is_repeated <lgl>, last_available_confirmed <dbl>,
## #   last_available_confirmed_per_100k_inhabitants <dbl>,
## #   last_available_date <date>, last_available_death_rate <dbl>,
## #   last_available_deaths <dbl>, order_for_place <dbl>, place_type <chr>,
## #   state <fct>, new_confirmed <dbl>, new_deaths <dbl>

As we can see above, there are a lot of variables for this dataset. As only a few of these variables were needed for this project, which included ‘date’, ‘state’, ‘estimated_population’, ‘new_confirmed’ and ‘new_deaths’, these were selected next. The variable ‘city’ was also included because according to the dictionary of variables, the NA values for this variable represents state data, and therefore, this variable helped to perform some filtering of the data. Later on, this variable was also removed.

However, before selecting the variables of interest, a copy of the original dataset was created so that if it’s necessary to recover any information from it later on, it’s not needed to load the dataset again.
original_data <- copy(data)
data <- data %>%
  select(date, city, state, estimated_population, new_confirmed, new_deaths)
head(data)
## # A tibble: 6 × 6
##   date       city       state estimated_population new_confirmed new_deaths
##   <date>     <chr>      <fct>                <dbl>         <dbl>      <dbl>
## 1 2020-03-17 Rio Branco AC                  413418             3          0
## 2 2020-03-17 <NA>       AC                  894470             3          0
## 3 2020-03-18 Rio Branco AC                  413418             0          0
## 4 2020-03-18 <NA>       AC                  894470             0          0
## 5 2020-03-19 Rio Branco AC                  413418             1          0
## 6 2020-03-19 <NA>       AC                  894470             1          0
As the focus of this study was to perform analysis of states and the entire country, only the data for states were kept. In order to do it, the NA values for the ‘city’ variable was filtered (these data refers to state as mentioned in the dictionary of variables). Following that, the variable ‘city’ was also removed.
data <- data[is.na(data$city),]
data <- select(data, -city)
head(data)
## # A tibble: 6 × 5
##   date       state estimated_population new_confirmed new_deaths
##   <date>     <fct>                <dbl>         <dbl>      <dbl>
## 1 2020-03-17 AC                  894470             3          0
## 2 2020-03-18 AC                  894470             0          0
## 3 2020-03-19 AC                  894470             1          0
## 4 2020-03-20 AC                  894470             3          0
## 5 2020-03-21 AC                  894470             4          0
## 6 2020-03-22 AC                  894470             0          0
With the data filtered and reorganised, the next step was to check the dimension of the data (number of registers and number of variables).
dim(data)
## [1] 20119     5
The filtered dataset presented 20,119 registers and 5 variables, which included ‘date’, ‘state’, ‘estimated_population’, ‘new_confirmed’ and ‘new_deaths’. Next, the data type of variables were also checked.
glimpse(data)
## Rows: 20,119
## Columns: 5
## $ date                 <date> 2020-03-17, 2020-03-18, 2020-03-19, 2020-03-20, …
## $ state                <fct> AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, A…
## $ estimated_population <dbl> 894470, 894470, 894470, 894470, 894470, 894470, 8…
## $ new_confirmed        <dbl> 3, 0, 1, 3, 4, 0, 6, 4, 2, 0, 2, 0, 9, 7, 1, 1, 2…
## $ new_deaths           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
The data type of the variables made sense and were loaded correctly by R. Next, the number of missing values (NA) was also checked.
lapply(data, function(x) { sum(is.na(x)) })
## $date
## [1] 0
## 
## $state
## [1] 0
## 
## $estimated_population
## [1] 0
## 
## $new_confirmed
## [1] 0
## 
## $new_deaths
## [1] 0
With no variable with missing values, the data was prepared for the exploratory analysis. But before that, in order to facilitate future analysis, a new variable was created to hold the month values separately. A variable for day was not created because in this study, the variables were only analysed for periods no shorter than months (with a few exceptions shown later). On the other hand, a variable for year was not created either because there is only the year 2021 presented data for the entire year.
data <- data %>%
  mutate(month = month(date)) %>%
  reframe(date, month, state, estimated_population, new_confirmed, new_deaths)

Exploratory Analysis

First of all the maximum and minimum values for the variable ‘date’ were checked to see if there were 2 years of data for this study.
data %>%
  select(date) %>%
  summarise(
    min_date = min(date),
    max_date = max(date)
  )
## # A tibble: 1 × 2
##   min_date   max_date  
##   <date>     <date>    
## 1 2020-02-25 2022-03-27
In order to use the most up to date information, a period of 2 years ending on 27/03/2022 was filtered.
data <- data[data$date >= date(ymd('2020-03-28')),]
After this transformation, the data comprised information starting from 28/03/2020 to 27/03/2022. Next, the ‘state’ variable was investigated to check inconsistencies and to confirm that the number of states was 27 (in fact, 26 states and 1 federal district)
data %>%
  select(state) %>%
  unique()
## # A tibble: 27 × 1
##    state
##    <fct>
##  1 AC   
##  2 AL   
##  3 AM   
##  4 AP   
##  5 BA   
##  6 CE   
##  7 DF   
##  8 ES   
##  9 GO   
## 10 MA   
## # ℹ 17 more rows
The variable ‘state’ presented no problem and all the states were correctly included. So, next step was to check the ‘estimated population’.
data %>%
  select(state, estimated_population) %>%
  unique()
## # A tibble: 27 × 2
##    state estimated_population
##    <fct>                <dbl>
##  1 AC                  894470
##  2 AL                 3351543
##  3 AM                 4207714
##  4 AP                  861773
##  5 BA                14930634
##  6 CE                 9187103
##  7 DF                 3055149
##  8 ES                 4064052
##  9 GO                 7113540
## 10 MA                 7114598
## # ℹ 17 more rows
These values were double-checked with the estimates presented in the IBGE website (Official Census Organisation), and were all correct. So, the next step was to investigate inconsistencies for the variables new_confirmed and new_deaths. Negative values do not make sense for this variable, as both the number of new confirmed cases and death should be 0 or a positive integer, but never negative.
data %>%
  filter(new_confirmed<0) %>%
  select(new_confirmed)
## # A tibble: 23 × 1
##    new_confirmed
##            <dbl>
##  1           -17
##  2           -23
##  3         -2845
##  4        -12028
##  5           -72
##  6            -9
##  7          -507
##  8           -25
##  9         -8246
## 10           -19
## # ℹ 13 more rows
data %>% 
  filter(new_deaths<0) %>%
  select(new_deaths)
## # A tibble: 34 × 1
##    new_deaths
##         <dbl>
##  1         -1
##  2         -3
##  3         -2
##  4         -2
##  5         -6
##  6         -1
##  7         -2
##  8         -1
##  9         -6
## 10         -1
## # ℹ 24 more rows
As observed above, there were a few negative values for both variables. This problem was solved by replacing the negative values for positive with the same magnitude by assuming that the dash was mistakenly included. After that, these variables were re-checked for negative values.
for (i in 2:length(data$new_confirmed))
  {
    if (data$new_confirmed[i] < 0)
      
    {
      data$new_confirmed[i] <- data$new_confirmed[i]*(-1)
    }
  
    if (data$new_deaths[i] < 0)
      
    {
      data$new_deaths[i] <- data$new_deaths[i]*(-1)
    }
  }

data %>%
  filter(new_confirmed<0) %>%
  select(new_confirmed)
## # A tibble: 0 × 1
## # ℹ 1 variable: new_confirmed <dbl>
data %>% 
  filter(new_deaths<0) %>%
  select(new_deaths)
## # A tibble: 0 × 1
## # ℹ 1 variable: new_deaths <dbl>
The problem was solved, so now we can start performing some analysis. Let’s first have a glance at the global situation by looking at the numbers for the entire country. In the following step, the estimated population for the entire country, the total number of new confirmed and new deaths, the proportion of new confirmed and population, the proportion of new deaths and population and the rate of deaths over confirmed were calculated and saved in a new dataframe named ‘data_country’.
data_country_all <- data %>%
  summarise(period = '28/03/2020 to 27/03/2022',
            country = 'Brazil',
            estimated_population = sum(unique(estimated_population)),
            new_confirmed = sum(new_confirmed),
            new_deaths = sum(new_deaths)) %>%
  mutate(prop_confirmed_population = new_confirmed/estimated_population,
         prop_deaths_population = new_deaths/estimated_population,
         deaths_confirmed_rate = new_deaths/new_confirmed)
data_country_all
## # A tibble: 1 × 8
##   period                   country estimated_population new_confirmed new_deaths
##   <chr>                    <chr>                  <dbl>         <dbl>      <dbl>
## 1 28/03/2020 to 27/03/2022 Brazil             211755692      29904964     659726
## # ℹ 3 more variables: prop_confirmed_population <dbl>,
## #   prop_deaths_population <dbl>, deaths_confirmed_rate <dbl>

For the period from 28/03/2020 to 27/02/2022 there were observed 29,904,964 of new confirmed cases and 659,726 deaths due to Covid19 in Brazil. This resulted in an approximate proportion of 14.1% and 0.3% of the Brazilian population that suffered from a new confirmed case and death, respectively. The proportion of death and new confirmed cases was about 2.2%.

Next the evolution of the Covid19 cases was evaluated. As the number of cases per day varies a lot, there would be a lot of noise in a graph created with this data directly. Therefore, the weekly sum of new confirmed cases and new deaths were added up to create a new dataframe named data_country_week that was used to support this analysis.
data_country_week <- data
data_country_week$week <- floor_date(data$date, "week")

data_country_week <- data_country_week %>%
  group_by(week) %>%
  summarise(estimated_population = sum(unique(estimated_population)),
            new_confirmed = sum(new_confirmed),
            new_deaths = sum(new_deaths)) %>%
  mutate(prop_confirmed_population = new_confirmed/estimated_population,
         prop_deaths_population = new_deaths/estimated_population,
         deaths_confirmed_rate = new_deaths/new_confirmed) %>%
ungroup()
head(data_country_week)
## # A tibble: 6 × 7
##   week       estimated_population new_confirmed new_deaths
##   <date>                    <dbl>         <dbl>      <dbl>
## 1 2020-03-22            211755692           477         22
## 2 2020-03-29            211755692          6428        330
## 3 2020-04-05            211755692         10610        696
## 4 2020-04-12            211755692         16184       1234
## 5 2020-04-19            211755692         22331       1699
## 6 2020-04-26            211755692         38653       2736
## # ℹ 3 more variables: prop_confirmed_population <dbl>,
## #   prop_deaths_population <dbl>, deaths_confirmed_rate <dbl>
In the plot below you can see the evolution of the number of new confirmed cases and new deaths caused by the Covid19 in Brazil, considering the studied period of 2 years. It is important to mention that the multiplier for the number of new confirmed cases is 10,000 and for the number of new deaths is 1,000. Therefore, the number you read in the graph should be multiplied by one of these numbers depending on which variable you are looking at.
colors <- c('Death' = 'red', 'Confirmed' = 'blue', 'Death/Confirmed Rate' = 'black')
x_annotation <- date(ymd_hms('2020-01-01 23:12:13', tz = 'America/New_York'))

ggplot(data=data_country_week) +
  geom_line(mapping = aes(x=week, y=new_confirmed / 10000, color='Confirmed')) +
  geom_line(mapping = aes(x=week, y=new_deaths / 1000, color='Death')) +
  annotate(geom='label', x=x_annotation, y=114, label='BRASIL (total)
  Period: 28/03/2020 to 27/03/2022
  Confirmed Cases: 29,904,964
  Deaths: 659,726', color='black', hjust = 0, fontface='bold', size=4) +
  labs(x = 'Week', color = 'Legend') +
  ggtitle('Evolution of Confirmed and Death Rates in Brazil') +
  scale_y_continuous(
    'Number of New Confirmed Cases (x 10,000)', 
    sec.axis = sec_axis(~ . * 1, name = 'Number of New Deaths (x 1,000)')) +
  scale_color_manual(values = colors) +
  theme(
    plot.title = element_text(size = 14, hjust = 0.5, face = 'bold'),
    axis.title = element_text(size = 11,face = 'bold'),
    axis.text = element_text(size = 10),
    axis.title.y.right = element_text(color = 'red'),
    axis.title.y.left = element_text(color = 'blue'),
    axis.text.y.right = element_text(color = 'red'),
    axis.text.y.left = element_text(color = 'blue'),
    legend.text = element_text(size = 7),
    legend.title = element_text(size = 8, face = 'bold'),
    legend.position = c(0.1, 0.6),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(colour = 'black', fill = '#F5F5F5'),
    panel.border = element_rect(size = 1, fill = NA)
  )

As shown above, the peaks for the number of new confirmed cases occurred in summer and spring of 2021 and summer of 2022, which the latter (2022) reached a level of approximately 1,300,000 (130 x 10,000) new confirmed cases but lasted a shorter period compared to the peak of 2021. The number of new deaths reached its peak (about 20,000 new deaths) in spring of 2021, coinciding with the peak of new confirmed cases in 2021. Although the peak of new confirmed case in 2022 was more than the double of the previous peak (2021), the number of new deaths for the year 2022 was dramatically reduced when compared to the number of new deaths in the first peak.

The following graph illustrates the Death/Confirmed Rate of Covid19 in Brazil for the period from 28/03/2020 to 27/03/2022.
x_annotation <- date(ymd_hms('2021-10-01 23:12:13', tz = 'America/New_York'))

ggplot(data=data_country_week) +
  geom_line(mapping = aes(x=week, y=deaths_confirmed_rate * 100, color='Death/Confirmed Rate')) +
  annotate(geom='label', x=x_annotation, y=6.85, label='BRASIL (total)
Period: 03/2020 to 02/2022
Death/Confirmed Rate: 2.2%', color='black', hjust = 0, fontface='bold', size=4) +
  labs(x = 'Week', y='Death/Confirmed Rate (%)', color = 'Legend') +

  ggtitle('Evlolution of Death/Confirmed Rate in Brazil') +
  scale_color_manual(values = colors) +
  theme(
    plot.title = element_text(size = 14, hjust = 0.5, face = 'bold'),
    axis.title = element_text(size=11,face='bold'),
    axis.text = element_text(size=10),
    legend.text = element_text(size = 7),
    legend.title = element_text(size = 8, face = 'bold'),
    legend.position = c(0.89, 0.688),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(colour = 'black'),
    panel.border = element_rect(size = 1, fill = NA)
  )

This plot confirmed that there was a much lower rate of new deaths and new confirmed cases in the end of the assessed period. Moreover, this plot also showed that even though the peaks of new confirmed and death cases happened in 2021 and 2022, the higher death/confirmed rate was observed right in the beginning of the pandemic, when the death/confirmed rate nearly reached 8%. This probably happened because we were not very aware about the disease, and therefore, not well prepared to fight against it, but we improved a lot over time.

Now, let’s dive a little deeper looking at data aggregated by state. First, a new dataset data_state was created with the totals of new confirmed and deaths cases.
data_state <- data %>%
  group_by(state) %>%
  summarise(estimated_population = sum(unique(estimated_population)),
            new_confirmed = sum(new_confirmed),
            new_deaths = sum(new_deaths)) %>%
  mutate(prop_confirmed_population = new_confirmed/estimated_population,
         prop_deaths_population = new_deaths/estimated_population,
         deaths_confirmed_rate = new_deaths/new_confirmed) %>%
ungroup()
As there are a lot of states to compare, let’s create a new variable with the regions, so that the 5 regions of Brazil can visualised and compared easier.
state <- c('AC', 'AL', 'AP', 'AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MA', 'MT', 'MS', 'MG', 'PA', 'PB', 'PR', 'PE', 'PI', 'RJ', 'RN', 'RS', 'RO', 'RR', 'SC', 'SP', 'SE', 'TO')
region <- c('N','NE','N','N','NE','NE','CO','SE','CO','NE','CO','CO','SE','N','NE','S','NE','NE','SE','NE','S','N','N','S','SE','NE','N')
region_state_list <- data.frame(region, state)
counter = 1

data_state$region = 0
for (i in 1:length(data_state$state)){
  for (j in 1:length(region_state_list$region)){
    if (data_state$state[i] == region_state_list$state[j]){
      data_state$region[i] = region_state_list$region[j]
      counter <- counter + 1
    }
  }
}

data_state <- data_state %>%
  reframe(region, state, estimated_population, new_confirmed, new_deaths, prop_confirmed_population, prop_deaths_population, deaths_confirmed_rate)
head(data_state)
## # A tibble: 6 × 8
##   region state estimated_population new_confirmed new_deaths
##   <chr>  <fct>                <dbl>         <dbl>      <dbl>
## 1 N      AC                  894470        123817       1994
## 2 NE     AL                 3351543        295960       6869
## 3 N      AM                 4207714        580989      14156
## 4 N      AP                  861773        160325       2122
## 5 NE     BA                14930634       1529977      29658
## 6 NE     CE                 9187103       1269210      26725
## # ℹ 3 more variables: prop_confirmed_population <dbl>,
## #   prop_deaths_population <dbl>, deaths_confirmed_rate <dbl>
Now, let’s create a visualisation to better understand the data by state.
gg <- ggplot(data_state) +
  geom_point(aes(x=prop_confirmed_population * 100,
                 y=prop_deaths_population * 100,
                 color=region,
                 size=deaths_confirmed_rate * 100,
                 group=state)) +
  theme_bw() +
  xlab("New Confirmed Case Rate (%)") +
  ylab("New Deaths Rate (%)") +
  ggtitle("Proportion of New Confirmed Case and Deaths Rates per State") +
  labs(color = 'Region') +
  guides(size = FALSE) +
  theme(
    plot.title = element_text(size = 14, hjust = 0.5, face = 'bold'),
    axis.title = element_text(size=11,face='bold'),
    axis.text = element_text(size=10),
    legend.text = element_text(size = 7),
    legend.title = element_text(size = 8, face = 'bold'),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(colour = 'black'),
    panel.border = element_rect(size = 1, fill = NA))

ggplotly(gg) %>%
  highlight("plotly_hover")

The plot above shows bubbles that represent each state of Brazil coloured by its regions. The rate of new confirmed case is plotted on the x axis and the rate of new deaths on the y axis, while the size of the bubbles represents the death/confirmed rate.

Overall, the regions with the highest new confirmed rate were the south (S) and central west (CO) whereas the highest new deaths rate was observed for south east (SE) and CO. The death/confirmed rate was also highest for SE. Overall, the NE presented the lowest new confirmed case rate and the lowest new deaths rate but, interestingly, the death/confirmed rate was mostly medium level, ranging from 1.64 to 2.56%.

When looking at the states, it is clear that the two big pink bubbles on top of the plot (‘São Paulo’ - SP and ‘Rio de Janeiro’ RJ) where the ones which presented the highest death/confirmed rate (3.19% and 3.49%, respectively), while ‘Santa Catarina’ (SC) presented the lowest (1.29%). The states with the highest new deaths rate were RJ and ‘Mato Grosso’ (MT) with values above 0.4%. With respect to new confirmed cases rate, the highest values were observed for ‘Espírito Santo’ (ES) and ‘Roraima’ (RR) (above 25%), which both of them presented a relatively low death/confirmed rate (< 1.4%).

Let’s create a new dataset named data_state_week to further investigate the evolution of Covid19 per state, although I guess the plot will be very polluted as there were 27 states.
data_state_week <- data
data_state_week$week <- floor_date(data$date, "week")

data_state_week <- data_state_week %>%
  group_by(state, week) %>%
  summarise(estimated_population = sum(unique(estimated_population)),
            new_confirmed = sum(new_confirmed),
            new_deaths = sum(new_deaths)) %>%
  mutate(prop_confirmed_population = new_confirmed/estimated_population,
         prop_deaths_population = new_deaths/estimated_population,
         deaths_confirmed_rate = new_deaths/new_confirmed) %>%
ungroup()
head(data_state_week)
## # A tibble: 6 × 8
##   state week       estimated_population new_confirmed new_deaths
##   <fct> <date>                    <dbl>         <dbl>      <dbl>
## 1 AC    2020-03-22               894470             0          0
## 2 AC    2020-03-29               894470            21          0
## 3 AC    2020-04-05               894470            26          2
## 4 AC    2020-04-12               894470            70          4
## 5 AC    2020-04-19               894470           116          5
## 6 AC    2020-04-26               894470           295         11
## # ℹ 3 more variables: prop_confirmed_population <dbl>,
## #   prop_deaths_population <dbl>, deaths_confirmed_rate <dbl>
ggplot(data=data_state_week) +
  geom_line(mapping = aes(x=week, y=new_confirmed, group=state ,color=state)) +
  labs(x = 'Week', y='Number of New Confirmed Cases') +
  ggtitle('Evlolution of New Confirmed Case per State') +

  theme(
    plot.title = element_text(size = 14, hjust = 0.5, face = 'bold'),
    axis.title = element_text(size=11,face='bold'),
    axis.text = element_text(size=10),
    legend.text = element_text(size = 7),
    legend.title = element_text(size = 8, face = 'bold'),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(colour = 'black'),
    panel.border = element_rect(size = 1, fill = NA)
  )

As predicted before, it is not possible to perform any analysis in this chart, it would be better to plot a few sates and check which one were more relevants to present. However, instead of that, let’s create a new dataframe named data_region_week and see if we get a better visualisation to compare the regions.
data_region_week <- data

counter = 1

data_region_week$region = 0
for (i in 1:length(data_region_week$state)){
  for (j in 1:length(region_state_list$region)){
    if (data_region_week$state[i] == region_state_list$state[j]){
      data_region_week$region[i] = region_state_list$region[j]
      counter <- counter + 1
    }
  }
}

data_region_week$week <- floor_date(data$date, "week")

data_region_week <- data_region_week %>%
  group_by(region, week) %>%
  summarise(estimated_population = sum(unique(estimated_population)),
            new_confirmed = sum(new_confirmed),
            new_deaths = sum(new_deaths)) %>%
  mutate(prop_confirmed_population = new_confirmed/estimated_population,
         prop_deaths_population = new_deaths/estimated_population,
         deaths_confirmed_rate = new_deaths/new_confirmed) %>%
ungroup()

data_region_week <- data_region_week %>%
  reframe(week, region, estimated_population, new_confirmed, new_deaths, prop_confirmed_population, prop_deaths_population, deaths_confirmed_rate)

head(data_region_week)
## # A tibble: 6 × 8
##   week       region estimated_population new_confirmed new_deaths
##   <date>     <chr>                 <dbl>         <dbl>      <dbl>
## 1 2020-03-22 CO                 16504303            29          0
## 2 2020-03-29 CO                 16504303           323         10
## 3 2020-04-05 CO                 16504303           341         18
## 4 2020-04-12 CO                 16504303           484         23
## 5 2020-04-19 CO                 16504303           483         16
## 6 2020-04-26 CO                 16504303          1040         15
## # ℹ 3 more variables: prop_confirmed_population <dbl>,
## #   prop_deaths_population <dbl>, deaths_confirmed_rate <dbl>
Plotting the new confirmed cases rate per state per week.
ggplot(data=data_region_week) +
  geom_line(mapping = aes(x=week, y=prop_confirmed_population * 100, group=region ,color=region)) +
  labs(x = 'Week', y='New Confirmed Cases Rate (%)') +
  ggtitle('Evlolution of New Confirmed Case per Region') +

  theme(
    plot.title = element_text(size = 14, hjust = 0.5, face = 'bold'),
    axis.title = element_text(size=11,face='bold'),
    axis.text = element_text(size=10),
    legend.text = element_text(size = 7),
    legend.title = element_text(size = 8, face = 'bold'),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(colour = 'black'),
    panel.border = element_rect(size = 1, fill = NA)
  )

The plot has improved compared to the previous version, but it is still with a lot of noise, which dificultate its analysis. So, let’s try to down one level and investigate the state data by month. Let’s create a new dataframe named data_region_month and see if it was possible some insights from it.
data_region_month <- data

counter = 1

data_region_month$region = 0
for (i in 1:length(data_region_month$state)){
  for (j in 1:length(region_state_list$region)){
    if (data_region_month$state[i] == region_state_list$state[j]){
      data_region_month$region[i] = region_state_list$region[j]
      counter <- counter + 1
    }
  }
}

data_region_month$month <- floor_date(data$date, "month")

data_region_month <- data_region_month %>%
  group_by(region, month) %>%
  summarise(estimated_population = sum(unique(estimated_population)),
            new_confirmed = sum(new_confirmed),
            new_deaths = sum(new_deaths)) %>%
  mutate(prop_confirmed_population = new_confirmed/estimated_population,
         prop_deaths_population = new_deaths/estimated_population,
         deaths_confirmed_rate = new_deaths/new_confirmed) %>%
ungroup()

data_region_month <- data_region_month %>%
  reframe(month, region, estimated_population, new_confirmed, new_deaths, prop_confirmed_population, prop_deaths_population, deaths_confirmed_rate)

head(data_region_month)
## # A tibble: 6 × 8
##   month      region estimated_population new_confirmed new_deaths
##   <date>     <chr>                 <dbl>         <dbl>      <dbl>
## 1 2020-03-01 CO                 16504303           142          4
## 2 2020-04-01 CO                 16504303          2290         74
## 3 2020-05-01 CO                 16504303         14798        304
## 4 2020-06-01 CO                 16504303         81549       1415
## 5 2020-07-01 CO                 16504303        154250       3585
## 6 2020-08-01 CO                 16504303        184565       4003
## # ℹ 3 more variables: prop_confirmed_population <dbl>,
## #   prop_deaths_population <dbl>, deaths_confirmed_rate <dbl>
Plotting the new confirmed cases rate per state per month.
ggplot(data=data_region_month) +
  geom_line(mapping = aes(x=month, y=prop_confirmed_population * 100, group=region ,color=region)) +
  labs(x = 'Month', y='New Confirmed Cases Rate (%)') +
  ggtitle('Evlolution of New Confirmed Case per Region per Month') +

  theme(
    plot.title = element_text(size = 14, hjust = 0.5, face = 'bold'),
    axis.title = element_text(size=11,face='bold'),
    axis.text = element_text(size=10),
    legend.text = element_text(size = 7),
    legend.title = element_text(size = 8, face = 'bold'),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(colour = 'black'),
    panel.border = element_rect(size = 1, fill = NA)
  )

Now, the plot is less polluted and we can visualise the curves for each state much better. It is also possible to confirm what was previously mentioned, as most of the time, the S presented the highest rate of new confirmed cases, followed by the CO, whereas the NE presented the lowest for most of the time. Moreover, the highest peak was observed for the region S (about 3%) and CO (about 2.1%) during the first months of 2022.

In the following step, let’s investigate the new deaths rate per region per month
ggplot(data=data_region_month) +
  geom_line(mapping = aes(x=month, y=prop_deaths_population * 100, group=region ,color=region)) +
  labs(x = 'Month', y='New Deaths Rate (%)') +
  ggtitle('Evlolution of New Deaths per Region per Month') +

  theme(
    plot.title = element_text(size = 14, hjust = 0.5, face = 'bold'),
    axis.title = element_text(size=11,face='bold'),
    axis.text = element_text(size=10),
    legend.text = element_text(size = 7),
    legend.title = element_text(size = 8, face = 'bold'),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(colour = 'black'),
    panel.border = element_rect(size = 1, fill = NA)
  )

For new deaths rate, it is interesting to note that a similar peak was reached in the beginning of the pandemic and in summer-spring time in the north (N) region. All the other regions followed a similar pattern. The NE regions presented again the lowest new deaths rate during the studied period, whereas S, CO and SP were the highest.

In the next step, let’s investigate the death/confirmed rate per region per month
ggplot(data=data_region_month) +
  geom_line(mapping = aes(x=month, y=deaths_confirmed_rate * 100, group=region ,color=region)) +
  labs(x = 'Month', y='Death/Confirmed Rate (%)') +
  ggtitle('Evlolution of Deaths/Confirmed Rate per Region per Month') +

  theme(
    plot.title = element_text(size = 14, hjust = 0.5, face = 'bold'),
    axis.title = element_text(size=11,face='bold'),
    axis.text = element_text(size=10),
    legend.text = element_text(size = 7),
    legend.title = element_text(size = 8, face = 'bold'),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(colour = 'black'),
    panel.border = element_rect(size = 1, fill = NA)
  )

In this last plot, it is clear that the SE presented the highest death/confirmed rate nearly for the entire period studied. It is also interesting to note that the regions S and CO had their highest peak for death/confirmed rate in summer-spring of 2021 and not at the beginning of the pandemic such as all other states as well as the Brazilian rate.

Finally, as there were data for the entire year of 2021, the new deaths rate due to Covid19 was calculate to compare its severity with other common causes of death in Brazil.
data %>%
  filter(date >= date(ymd('2021-01-01')), date <= date(ymd('2021-12-31'))) %>%
  summarise(total_confirmed = sum(new_confirmed),
            total_death = sum(new_deaths),
            death_rate = sum(new_deaths) / sum(new_confirmed) * 100)
## # A tibble: 1 × 3
##   total_confirmed total_death death_rate
##             <dbl>       <dbl>      <dbl>
## 1        14649109      424629       2.90
As shown above, Brazil witnessed 14,649,109 new confirmed cases and 424,629 fatalities from Covid-19 in 2021, translating to a death rate of approximately 2.9%. The Brazilian Institute of Geography and Statistics (IBGE) reported that in 2021, Covid-19 was responsible for 26.6% of all deaths, making it the leading cause of death. Diseases related to the circulatory system followed with 20.6% of deaths, and tumors were the third most common cause, accounting for 12.8%. While Covid-19 had a significant impact, the mortality rates from several other global diseases were considerably higher. Furthermore, the United Nations has highlighted that diseases linked to inadequate drinking water, sanitation, and hygiene are leading to preventable fatalities; in 2019, diarrhea, primarily caused by these deficiencies, was responsible for over 69% of deaths globally.

Key Findings

  1. Although the Covid19 was able to mutate and spread over quickly, the fatality rate was relatively low.
  2. The Brazilian peak of new confirmed cases of Covid19 was during spring to summer of 2021 whereas the peak of new death occurred during the beginning of the pandemic.
  3. Cases of Covid19 were more letal in the South East (SE) of Brazil, especially in the states of ‘Rio de Janeiro’ and ‘São Paulo’.